Introduction

Welcome to my 2024 Advanced Next Generation Sequencing Data project. In this study, I investigate changes in the transcriptome of the chicken retina and cornea.

The retina and cornea are vital components of avian vision, influencing how chickens perceive their surroundings. Understanding the genetic mechanisms underlying their development and function is crucial for various fields, including developmental biology, ophthalmology, and evolutionary genetics.

Using advanced sequencing techniques, I analyze the gene expression patterns in these ocular tissues to uncover the genetic factors shaping their development and maintenance. By examining the chicken genome, I aim to elucidate the molecular processes driving the formation and function of the avian visual system.

Question of interest

Understanding the molecular mechanisms underlying retinal and corneal development is crucial for elucidating the pathogenesis of various ocular diseases and developing effective therapeutic interventions. As highly conserved processes among vertebrates, studying these mechanisms in model organisms like the chicken provides valuable insights into human ocular development and pathology. Specifically, investigating the expression patterns of HOX and Homeobox genes in the chicken embryo retina and cornea across different developmental stages can shed light on the intricate regulatory networks governing ocular morphogenesis. By delineating these patterns and their significance, we can not only enhance our understanding of ocular development but also establish the chicken as a robust model organism for human genomic studies in the ocular region

Hypothesis

HOX and Homeobox genes are required for specification and differentiation of retinal cells. The expression patterns of Hox and Homeobox genes change over the course of chicken retinal development, with significant shifts anticipated at key stages such as E8, E16, and E18, when whole retinas were harvested. Additionally, at E18, when whole corneas were collected alongside retinas, distinct gene expression profiles may emerge, reflecting the coordinated orchestration of retinal and corneal development processes

Results

Key insights

The DEG analysis unveiled a significant number of genes for each comparison: 6619 for E16 vs E8, 6974 for E18 vs E8, and 8201 for cornea vs retina, all exhibiting a |log2FoldChange| > 1 threshold. Noteworthy Gene Ontology (GO) terms in all three analyses suggest a profound involvement of these genes in various biological processes. For instance, terms such as “multicellular organism development,” “system development,” and “nervous system development” emerged as prominent themes. These findings underscore the critical role of the identified genes in orchestrating complex developmental processes, including cellular differentiation, organogenesis, and neurodevelopment, within the ocular tissues.

Limitations

One notable limitation is the variability in read lengths across samples, potentially introducing bias, particularly in samples with shorter reads. This observation is displayed in the table below.

sample name condition region read length
E8_retina_rep1 E8 retina 76
E8_retina_rep2 E8 retina 76
E16_retina_rep1 E16 retina 150
E16_retina_rep2 E16 retina 150
E18_retina_rep1 E18 retina 76
E18_retina_rep2 E18 retina 76
E18_cornea_rep1 E18 cornea 147
E18_cornea_rep2 E18 cornea 147

Additionally, the narrow window of embryonic developmental stages observed in this study—E8, E16, and E18 for retina tissue, and E18 for cornea—may limit the comprehensive understanding of gene expression dynamics throughout all developmental stages. While focusing on these stages may overlook potential differences present in other developmental phases, the analysis aims to provide insights into early, middle, and late retina development.

Future experiments and analyses

In future endeavors, incorporating single-cell RNA sequencing (scRNA-seq) data could enrich our understanding of cellular heterogeneity and gene expression patterns at a finer resolution, enabling the identification of cell-specific regulatory networks and molecular signatures underlying retinal and corneal development. Additionally, integrating spatial transcriptomics techniques could elucidate the spatial organization of gene expression within ocular tissues, providing spatial context to the observed expression dynamics.

Method

1. Data record

My project was inspired by the paper entilted, RNA sequencing analysis of the developing chicken retina which was published in Scientific Data in 2016. The transcriptomic data discussed in the paper was deposited in NCBI Gene Expression Omnibus (GEO) under the accession number GSE65938. In short, the data was generated by James Madison University’s Center for Genome & Metagenome Studies (CGEMS) and the experiments were conducted with approval from the James Madison University Institutional Animal Care and Use Committee and adhered to the guidelines outlined in the National Institutes of Health guide for the care and use of laboratory animals. RNA extraction was performed from eight embryonic chicken ocular tissues, including whole retinas from E8, E16, and E18 developing chicken embryos, as well as whole corneas collected from E18 embryos. Duplicates were obtained for each time point, and total RNA was extracted using a Qiagen AllPrep Mini Kit, following the manufacturer’s instructions. cDNA libraries were prepared using a poly dT primer, thus the data set is representative of only polyadenylated mRNA transcripts and does not represent non-coding RNA or other non-polyadenylated cellular transcripts. Finally, Illumina Next Generation Sequencing (NGS) technology was utilized for sequencing the RNA samples, providing high-throughput and accurate transcriptomic data acquisition.

2. Begining to download the data

downloading script is located in the following directory, or in the scripts folder in GitHub.

To acquire the data, I began by storing all the accession numbers from GEO into a file using the SRA run selector (accession_list.txt). This step allowed me to execute my script for downloading the data. In total, I obtained 8 samples that represent various embryonic time points from both the chicken’s retina and cornea.

(angsd) bash-4.4$ nano accession_list.txt
SRR1804235
SRR1804236
SRR1804237
SRR1804238
SRR1804239
SRR1804240
SRR1804241
SRR1804242

The following is my downloading script

(angsd) bash-4.4$ nano sra_download.sh 
(angsd) bash-4.4$ chmod u+x sra_download.sh 

#!/bin/bash
accession_num=$(cat accession_list.txt)

# make new directory to hold fastq files
#mkdir fastq
#cd fastq

# Read accession numbers from the file and iterate over them
while read -r accession_num; do
    echo "$accession_num"  # Print each accession number before downloading
    fasterq-dump "$accession_num" -F fastq -O fastq --skip-technical --split-3 --threads 3
done < accession_list.txt

echo "All downloads complete :)"

3. Quality Control: fastQC and MultiQC

running fastQC and multiQC on my raw files script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/fastq/fastqc_multiqc.sh, or in the scripts folder in my GitHub repository.

To assess the quality of the downloaded samples, I developed a script that executes both fastQC and multiQC on each of my samples.

for file in *_*.fastq; do fastqc "$file"; done 

multiqc *_fastqc.zip

The multiQC report indicates low adapter content across all my samples. Therefore, I made the decision not to trim my reads.

4. Alignment using STAR

genome generate script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/genome/STAR_genomeGenerate.sh, or in the scripts folder in my GitHub repository.

Before aligning my reads, I downloaded the chicken genome galGal6 and indexed it. While my paper originally utilized galGal4, which was likely the latest version available at the time, the authors mentioned that galGal6 would also be compatible. Considering that galGal6 likely offers a more comprehensive genome than galGal4, and since the paper explicitly stated that galGal6 would suffice, I opted to utilize galGal6 for my analysis.

STAR --runMode genomeGenerate \
     --runThreadN 5 \
     --genomeDir galGal6_STAR_Index \
     --genomeFastaFiles galGal6.fa \
     --sjdbGTFfile galGal6.ncbiRefSeq.gtf \
     --sjdbOverhang 99 \

–runThreadN 5: Number of threads

–runMode genomeGenerate: This tells STAR to run in genome indexing mode.

–genomeDir “galGal6_STAR_Index”: Specifies the directory where the generated genome indices will be stored.

–genomeFastaFiles “galGal6.fa”: Specifies the path to the FASTA files containing the reference genome sequences.

–sjdbGTFfile “galGal6.ncbiRefSeq.gtf”: Specifies the path to the GTF file containing the genome annotation.

–sjdbOverhang 99: This parameter sets the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database.

Running STAR using my script:

alignment script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/Alignment/STAR_align.sh, or in the scripts folder in my GitHub repository.

#!/bin/bash

# E8 retina rep 1 Align reads

STAR --runMode alignReads \
     --outFilterMultimapNmax 20 \
     --runThreadN 8 \
     --outFileNamePrefix E8_retina_rep1_ \
     --genomeDir ../genome/galGal6_STAR_Index \
     --readFilesIn ../fastq/SRR1804237_1.fastq.gz ../fastq/SRR1804237_2.fastq.gz \
     --readFilesCommand zcat \
     --outFilterType BySJout \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes NH HI AS nM MD


# E8 retina rep 2 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E8_retina_rep2_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804238_1.fastq.gz ../fastq/SRR1804238_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

# E16 retina rep 1 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E16_retina_rep1_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804235_1.fastq.gz ../fastq/SRR1804235_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

# E16 retina rep 2 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E16_retina_rep2_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804236_1.fastq.gz ../fastq/SRR1804236_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

# E18 retina rep 1 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E18_retina_rep1_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804239_1.fastq.gz ../fastq/SRR1804239_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD


# E18 retina rep 2 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E18_retina_rep2_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804240_1.fastq.gz ../fastq/SRR1804240_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD


# E18 cornea rep 1 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E18_cornea_rep1_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804241_1.fastq.gz ../fastq/SRR1804241_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

# E18 cornea rep 2 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E18_cornea_rep2_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804242_1.fastq.gz ../fastq/SRR1804242_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

My choices for options to include were influenced by the alignment script that was used in the paper to align the data.

--runMode alignReads: This specifies the operation mode of STAR. alignReads tells STAR to perform the read alignment against the prepared genome index.

--runThreadN 8: Number of threads

--genomeDir "galGal6_STAR_Index": This sets the path to the directory containing the genome index files that were previously generated with STAR in genomeGenerate mode.

--readFilesIn "$sample.fastq.gz": Specifies the input files containing the RNAseq reads.

--outFileNamePrefix: This parameter sets the prefix for all output files generated by STAR.

--outSAMtype BAM SortedByCoordinate: This tells STAR to output the alignment results in BAM format, sorted by genomic coordinates.

--outFilterMultimapNmax: maximum number of loci the read is allowed to map to.

--outSAMattributes: specify SAMtags

outFilterType: BySJout: keep only those reads that contain junctions that passed filtering into SJ.out.tab

After this, I ran bamQC on my alignment files. Below, I show one of those reports for one sample.

bamQC script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/Alignment/align_qc.sh , or in the scripts folder in my GitHub repository.

#!/bin/bash
for file in *.bam; do
        name=$(echo "$file" | sed "s/_Aligned.sortedByCoord.out.bam//")

        if [[ ! -f "$name.report.pdf" ]]; then
                qualimap bamqc -bam $file -outformat pdf -outfile "$name.report.pdf" -outdir /athena/angsd/scratch/shp4022/project/Alignment/Alignment_qc &
        fi
done

wait
knitr::include_graphics("./E16_retina_rep1.report.pdf")

I also executed multiQC on my final.out files to generate a report detailing the mapping status of my reads, including information on mapped, unmapped, and multimapped reads, among others.

5. Feature Counts

feature counts script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/Alignment/fc.sh , or in the scripts folder in my GitHub repository.

Next, I will execute featureCounts on my samples to evaluate gene mapping and conduct downstream analyses such as differential gene expression (DGE). The following is my featureCounts script.

#!/bin/sh

featureCounts -F GTF -g gene_id -t exon -T 32 -a ../genome/galGal6.ncbiRefSeq.gtf \
        -o ../featureCount_results/featureCounts.txt -s 2 -p --countReadPairs -B *_Aligned.sortedByCoord.out.bam

My choices for options include:

--F "GTF": Specify format of the provided annotation file.

--g "gene_id": Specify attribute type in GTF annotation

--t "exon": Specify feature type(s) in a GTF annotation.

--T "32": Number of threads

--a "galGal6.ncbiRefSeq.gtf": Name of an annotation file.

--o "../featureCount_results/featureCounts.txt": Name of output file including read counts

--s "2": Perform strand-specific read counting. ‘2’ (reverse stranded)

--p: If specified, libraries are assumed to contain paired-end reads.

--B: Only count read pairs that have both ends aligned.

--countReadPairs : fragments (or templates) will be counted instead of reads

6. Processing Feature Counts in Rstudio

Following the completion of featureCounts, I retrieved the resulting files from the analysis to incorporate into my downstream workflow. Below is the setup for my analysis, where I have listed all the required packages:

suppressMessages(library(ggplot2))
suppressMessages(library(tidyverse))
suppressMessages(library(dplyr))
suppressMessages(library(plotly))
suppressMessages(library(DESeq2))
## Warning: package 'DESeq2' was built under R version 4.3.3
## Warning: package 'GenomeInfoDb' was built under R version 4.3.3
suppressMessages(library(DT))
suppressMessages(library(magrittr))
suppressMessages(library(EnhancedVolcano))
suppressMessages(library(pheatmap))
suppressMessages(library(clusterProfiler))
## Warning: package 'clusterProfiler' was built under R version 4.3.3
suppressMessages(library(enrichplot))
suppressMessages(library(DOSE))
suppressMessages(library(org.Gg.eg.db))

Here is a comprehensive list detailing all packages, dependencies, and their respective versions utilized in this project.

Version Control:

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Gg.eg.db_3.18.0         AnnotationDbi_1.64.1       
##  [3] DOSE_3.28.2                 enrichplot_1.22.0          
##  [5] clusterProfiler_4.10.1      pheatmap_1.0.12            
##  [7] EnhancedVolcano_1.20.0      ggrepel_0.9.5              
##  [9] magrittr_2.0.3              DT_0.33                    
## [11] DESeq2_1.42.1               SummarizedExperiment_1.32.0
## [13] Biobase_2.62.0              MatrixGenerics_1.14.0      
## [15] matrixStats_1.3.0           GenomicRanges_1.54.1       
## [17] GenomeInfoDb_1.38.8         IRanges_2.36.0             
## [19] S4Vectors_0.40.2            BiocGenerics_0.48.1        
## [21] plotly_4.10.4               lubridate_1.9.3            
## [23] forcats_1.0.0               stringr_1.5.1              
## [25] dplyr_1.1.4                 purrr_1.0.2                
## [27] readr_2.1.5                 tidyr_1.3.1                
## [29] tibble_3.2.1                tidyverse_2.0.0            
## [31] ggplot2_3.5.0              
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.16.0       jsonlite_1.8.8         
##   [4] farver_2.1.1            rmarkdown_2.26          fs_1.6.3               
##   [7] zlibbioc_1.48.2         vctrs_0.6.5             memoise_2.0.1          
##  [10] RCurl_1.98-1.14         ggtree_3.10.1           htmltools_0.5.8.1      
##  [13] S4Arrays_1.2.1          SparseArray_1.2.4       gridGraphics_0.5-1     
##  [16] sass_0.4.9              bslib_0.7.0             htmlwidgets_1.6.4      
##  [19] plyr_1.8.9              cachem_1.0.8            igraph_2.0.3           
##  [22] lifecycle_1.0.4         pkgconfig_2.0.3         gson_0.1.0             
##  [25] Matrix_1.6-5            R6_2.5.1                fastmap_1.1.1          
##  [28] GenomeInfoDbData_1.2.11 digest_0.6.35           aplot_0.2.2            
##  [31] colorspace_2.1-0        patchwork_1.2.0.9000    RSQLite_2.3.6          
##  [34] fansi_1.0.6             timechange_0.3.0        httr_1.4.7             
##  [37] polyclip_1.10-6         abind_1.4-5             compiler_4.3.2         
##  [40] bit64_4.0.5             withr_3.0.0             BiocParallel_1.36.0    
##  [43] viridis_0.6.5           DBI_1.2.2               highr_0.10             
##  [46] ggforce_0.4.2           MASS_7.3-60.0.1         DelayedArray_0.28.0    
##  [49] HDO.db_0.99.1           tools_4.3.2             scatterpie_0.2.2       
##  [52] ape_5.8                 glue_1.7.0              nlme_3.1-164           
##  [55] GOSemSim_2.28.1         shadowtext_0.1.3        grid_4.3.2             
##  [58] reshape2_1.4.4          fgsea_1.28.0            generics_0.1.3         
##  [61] gtable_0.3.4            tzdb_0.4.0              data.table_1.15.4      
##  [64] hms_1.1.3               tidygraph_1.3.1         utf8_1.2.4             
##  [67] XVector_0.42.0          pillar_1.9.0            yulab.utils_0.1.4      
##  [70] splines_4.3.2           tweenr_2.0.3            treeio_1.26.0          
##  [73] lattice_0.22-6          bit_4.0.5               tidyselect_1.2.1       
##  [76] GO.db_3.18.0            locfit_1.5-9.9          Biostrings_2.70.3      
##  [79] knitr_1.45              gridExtra_2.3           xfun_0.43              
##  [82] graphlayouts_1.1.1      stringi_1.8.3           lazyeval_0.2.2         
##  [85] ggfun_0.1.4             yaml_2.3.8              evaluate_0.23          
##  [88] codetools_0.2-20        ggraph_2.2.1            qvalue_2.34.0          
##  [91] ggplotify_0.1.2         cli_3.6.2               munsell_0.5.1          
##  [94] jquerylib_0.1.4         Rcpp_1.0.12             png_0.1-8              
##  [97] parallel_4.3.2          blob_1.2.4              bitops_1.0-7           
## [100] tidytree_0.4.6          viridisLite_0.4.2       scales_1.3.0           
## [103] crayon_1.5.2            rlang_1.1.3             cowplot_1.1.3          
## [106] fastmatch_1.1-4         KEGGREST_1.42.0

This summary provides a snapshot of the environment and tools utilized, ensuring transparency and reproducibility in the project.

The commands below facilitate the generation of a read count table based on the featureCounts data. Subsequently, I will make some modifications to enhance readability.

#setwd("/Users/shaunp/Desktop/Weill_Cornell_Graduate/Grad_School/SPRING_2024/Analysis_Next-Gen_Sequencing_Data/Project")
read_counts <- read.table(file = "featureCounts.txt.summary", 
                          header = TRUE)
read_counts %>% 
  head(5) %>%
  DT::datatable()
read_counts %>%
  dplyr::slice(c(1, 9, 12, 14)) -> read_counts

names(read_counts) %>%
  stringr::str_remove("_Aligned.sortedByCoord.out.bam") -> names(read_counts)

names(read_counts) %>%
  stringr::str_extract("\\d$") %>%
  #as.character() %>%
  stringr::str_replace_na(replacement = "rep#") -> rep_number
  #c("rep#") %>%
  #.[1:10] %>%
  #stringr::str_c("rep#")
  #append(values = "rep#", before = "1") -> rep_number

# names(read_counts) |>
#   stringr::str_remove("_rep\\d") -> names(read_counts)
  
# read_counts %>%
#   rbind(rep_number) -> read_counts

read_counts %>%
  dplyr::relocate(E8_retina_rep1, .before = E16_retina_rep1) %>%
  dplyr::relocate(E8_retina_rep2, .after = E8_retina_rep1) %>%
  dplyr::relocate(E18_retina_rep1, .before = E18_cornea_rep1) %>%
  dplyr::relocate(E18_retina_rep2, .after = E18_retina_rep1) -> read_counts


read_counts %>%
  head(5) %>%
  DT::datatable()

Now, I will generate a count plot illustrating the breakdown of counts, including fragments that are assigned, unassigned_ambiguity, unassigned_multimapping, and unassigned_noFeatures for each sample.

read_counts_long <- pivot_longer(read_counts, cols = -Status, names_to = "Sample", values_to = "Count")

# read_counts_long |>
#   ggplot(aes(x = Sample, y = Count, fill = Status)) +
#   geom_bar(stat = "identity", position = "dodge") +
#   #scale_y_log10() +
#   labs(
#     title = "Feature Counts for Embryo retina/cornea development", 
#     x = "Sample", 
#     y = "Count") +
#   coord_flip() -> plot
read_counts_long |>
ggplot(aes(y = Sample, x = Count, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge", orientation = "y") +
  theme_minimal() +
  labs(title = "Counts for Each Replicate and Status", y = "Replicate", x = "Count") -> plot

plot %>%
  ggplotly()

The count graph illustrates that each replicate displays a similar pattern, with the majority falling into the “assigned” category. This suggests that the alignment and feature count steps were performed optimally.

Notably, “E8_rep1_retina” shows much higher assignment than other replicates, suggesting potential differences in treatment or other factors influencing its count distribution.

7. Quality Control

Following the visualization of my read count data, I proceeded to conduct quality control (QC) on my featureCounts.txt data file. Below are the commands I utilized to refine my data for enhanced readability, perform QC, and set up my DESEq object.

## reading in featureCounts output
fc_counts <- read.table("featureCounts.txt" , header = TRUE)
names(fc_counts) %>%
  stringr::str_remove("_Aligned.sortedByCoord.out.bam") -> names(fc_counts)
fc_counts %>%
  dplyr::select(1, 7:14) %>%
  dplyr::relocate(E8_retina_rep1, .before = E16_retina_rep1) %>%
  dplyr::relocate(E8_retina_rep2, .after = E8_retina_rep1) %>%
  dplyr::relocate(E18_retina_rep1, .before = E18_cornea_rep1) %>%
  dplyr::relocate(E18_retina_rep2, .after = E18_retina_rep1) -> fc_counts
row.names(fc_counts) <- make.names(fc_counts$Geneid)
# fc_counts %>%
#   row.names <- make.names(pull(Geneid))
fc_counts$Geneid <- NULL

fc_counts %>%
  as.matrix -> fc_counts_mtx

fc_counts_mtx %>%
  head(5) %>%
  DT::datatable()
fc_counts %>%
  names %>%
  stringr::str_extract(pattern = "^E\\d+") %>%
  data.frame(condition = ., row.names = colnames(fc_counts_mtx)) -> df_coldata

# fc_counts %>%
#   names %>%
#   #stringr::str_extract(pattern = "^_")
#   stringr::str_split(pattern = "_", n = 3) 

fc_counts %>% 
  names %>% 
  stringr::str_extract(pattern = "_[a-z]+_") %>%
  stringr::str_remove_all(pattern = "_") ->
  df_coldata$region

df_coldata %>%
  #print %>%
  DT::datatable()

Here, I added the median read length to my df_coldata object as a factor. This allows for tracking of sample read lengths for easier reference.

df_coldata %>%
  mutate(readlength=factor(c(76,76,150,150,76,76,147,147)))
##                 condition region readlength
## E8_retina_rep1         E8 retina         76
## E8_retina_rep2         E8 retina         76
## E16_retina_rep1       E16 retina        150
## E16_retina_rep2       E16 retina        150
## E18_retina_rep1       E18 retina         76
## E18_retina_rep2       E18 retina         76
## E18_cornea_rep1       E18 cornea        147
## E18_cornea_rep2       E18 cornea        147

Finally, after manipulating my feature count object, I filtered out rows with zero counts and converted it to a matrix to set up my DESeq object.

fc_counts %>%
  dplyr::filter(rowSums(.) != 0) -> fc_counts

fc_counts %>%
  as.matrix() -> fc_counts_mtx
  
# fc_counts_mtx %>%
#   factor(levels = c('76 bp', '76 bp', '150 bp', '150 bp', '76 bp', '76 bp', '150 bp',
#                     '150 bp', '147 bp', '147 bp'), 
#          labels = c('read length E8_retina_1', 'read length E8_retina_2', 'read length E16_retina_1', 
#                     'read length E16_retina_2', 'read length E18_retina_1', 'read length E18_retina_2',
#                     'read length E18_cornea_1', 'read length E18_cornea_2'))

Setting up DESeq object:

This code chunk creates a DESeqDataSet object (dds) from the count data matrix (fc_counts_mtx), column data (df_coldata), and a design formula specifying the experimental design (~ condition + region). The resulting object dds is displayed.

dds <- DESeqDataSetFromMatrix(countData = fc_counts_mtx,
                       colData = df_coldata,
                       design = ~ condition + region)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds
## class: DESeqDataSet 
## dim: 21225 8 
## metadata(1): version
## assays(1): counts
## rownames(21225): LOC107050604 LOC425607 ... LOC107049475 LOC112532827
## rowData names(0):
## colnames(8): E8_retina_rep1 E8_retina_rep2 ... E18_cornea_rep1
##   E18_cornea_rep2
## colData names(2): condition region

In this chunk, the first eight columns of the feature count data (fc_counts) are extracted and assigned to df_rowdata. Then, the row data of the DESeqDataSet object dds is set to df_rowdata. The modified dds object is displayed.

fc_counts[,1:8] -> df_rowdata
rowData(dds) <- df_rowdata

dds
## class: DESeqDataSet 
## dim: 21225 8 
## metadata(1): version
## assays(1): counts
## rownames(21225): LOC107050604 LOC425607 ... LOC107049475 LOC112532827
## rowData names(8): E8_retina_rep1 E8_retina_rep2 ... E18_cornea_rep1
##   E18_cornea_rep2
## colnames(8): E8_retina_rep1 E8_retina_rep2 ... E18_cornea_rep1
##   E18_cornea_rep2
## colData names(2): condition region
assay(dds, "counts") %>%
  head %>%
  DT::datatable()

Here, counts are extracted from the DESeqDataSet object dds using the assay() function, specifying “counts” as the assay type. The first few rows of the counts data are displayed as a datatable.

This code below calculates the column sums of counts from the DESeqDataSet object dds using counts(), then creates a barplot of the sums.

colSums(counts(dds)) %>% 
  barplot(main = "Gene Count Plot for dds", xlab = "Replicates", ylab = "Gene Counts", col = "lightblue", 
          sub = "(Figure 1) This plot displays the gene counts for all replicates in the analysis after removing genes with no expression.")

After performing this step, I normalized the dds object to ensure better comparability among samples, which is crucial for a more robust DESeq analysis later on. As observed in the plot, some replicates have a greater number of represented genes compared to others.

dds %>%
  estimateSizeFactors() -> dds
dds_norm_counts <- counts(dds, normalized = TRUE)

par(mfrow=c(1,2)) # to plot the two box plots next to each other
## bp of non-normalized
boxplot(counts(dds), 
        main = "Non-Normalized Read Counts", 
        xlab = "Sample", 
        ylab = "Counts",
        sub = "(Figure 2) Non-normalized read counts exhibit considerable variability.") 


boxplot(dds_norm_counts, 
        main = "Normalized Read Counts", 
        xlab = "Sample", 
        ylab = "Counts",
        sub = "(Figure 3) Normalized read counts enhance comparability across samples by standardizing the underlying distribution.")
dds %>%
  estimateSizeFactors() -> dds
dds_norm_counts <- counts(dds, normalized = TRUE)

par(mfrow=c(1,2)) # to plot the two box plots next to each other
## bp of non-normalized
boxplot(log2(counts(dds) + 1), 
        notch=TRUE,
        main = "Non-normalized Read Counts",
        xlab = "Sample",
        ylab = "log2(read counts)", 
        cex = 0.6,
        sub =  "(Figure 2) Non-normalized read counts exhibit considerable variability.")

boxplot(log2(counts(dds, normalize = TRUE) + 1), 
        notch=TRUE,
        main = "Normalized Read Counts",
        xlab = "Sample",
        ylab = "log2(read counts)", 
        cex = 0.6,
        sub = "(Figure 3) Normalized read counts enhance comparability across samples by standardizing the underlying distribution.")

8. Principle Component Analysis (PCA)

Before performing PCA, I applied the rlogTransformation to normalize my data.

dds %>%
  rlogTransformation %>%
  plotPCA(intgroup = c("condition", "region")) + labs(
    title = "PCA", 
    subtitle = "(Figure 4) PCA plot of my dds object"
  ) 
## using ntop=500 top features by variance

PC1 explains 86% of the variance in my data. Upon examining the graph, it appears that PC1 has effectively separated my samples based on the optical region. This observation aligns with the expectation that samples from different regions would cluster separately. In essence, it is reasonable to assume that the primary source of variation among my samples is the region from which they were extracted, specifically, the retina and cornea of a chicken embryo.

PC2 explains 13% of the variance, which appears to be associated with embryonic age. Given that E8 represents the earliest time point, it is not surprising that it does not cluster with E16 or E18. This suggests that E8, which I intend to use as my control, exhibits distinct characteristics compared to other time points, regardless of their origin from retina or cornea cells.

This finding holds promise for differential gene expression analysis, indicating that the time points before and after E8 have more significant differences, potentially attributable to developmental events. Since developmental differences are apparent in this PCA, I have high hopes that the developmental genes I am focusing on, such as HOX/Homeobox genes, would yield promising insights.

Correlation heatmaps

The purpose of plotting a correlation heatmap was to explore the similarities between my samples: E8_retina_rep1, E8_retina_rep2, E16_retina_rep1, E16_retina_rep2, E18_retina_rep1, E18_retina_rep2, E18_cornea_rep1, and E18_cornea_rep2.

Using pcaExplorer, I generated the correlation plot, and the results are presented here. As expected, the same samples show perfect correlation with themselves. Replicates of the same sample are highly correlated, as anticipated. In contrast, retina and cornea region samples exhibit little to no correlation, which aligns with their status as distinct eye regions. Additionally, the embryonic time points correlate in proportion to their temporal proximity within the retina. For instance, E8 shows a stronger correlation with E16 than with E18. This pattern mirrors the findings from my PCA plot, but presented here as a heatmap. The analysis was based on normalized counts

This code snippet utilizes pcaExplorer to analyze the dds object with the transformed data dst_rlog. The resulting visualization is presented below:

#pcaExplorer::pcaExplorer(dds=dds, dst=dst_rlog)

(Figure 5): Correlation plot of normalized data.

9. Differentally Expressed Genes (DEGs)

The primary objective of this analysis was to examine the expression patterns of HOX and Homeobox genes, which play a crucial role in the specification and differentiation of retinal and corneal cells. These expression patterns are expected to shift during key stages of chicken retinal development, particularly at E8, E16, and E18, when whole retinas were collected. Furthermore, at E18, whole corneas were also collected, offering insights into distinct gene expression profiles and their impact on the coordinated development of retinal and corneal tissues.

I focused on all 39 HOX genes and a targeted subset of Homeobox genes known to be expressed in the retina, as identified in the paper, The role of homeobox genes in retinal development and disease.

The analysis began with determining the number of significant genes, which was then narrowed down to identify the expression of HOX and Homeobox genes specifically.

dds$condition %<>%
  relevel(ref = "E8")
dds$condition
## [1] E8  E8  E16 E16 E18 E18 E18 E18
## Levels: E8 E16 E18
dds$region %<>%
  relevel(ref = "retina")
dds$region
## [1] retina retina retina retina retina retina cornea cornea
## Levels: retina cornea

This code snippet adjusts the levels of the condition and region variables within the dds object. Setting “E8” as the reference level for the condition variable and “retina” as the reference level for the region variable ensures that these specific conditions serve as controls during comparisons with other conditions. This approach aids in effectively exploring the hypothesis under investigation.

dds %<>%
  DESeq()
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Here, I’m running the DESeq analysis (dds %<>% DESeq()) to perform differential gene expression comparisons. This step is crucial for identifying genes that are significantly differentially expressed across experimental conditions.

rowData(dds) %>% 
  colnames
##  [1] "E8_retina_rep1"                       
##  [2] "E8_retina_rep2"                       
##  [3] "E16_retina_rep1"                      
##  [4] "E16_retina_rep2"                      
##  [5] "E18_retina_rep1"                      
##  [6] "E18_retina_rep2"                      
##  [7] "E18_cornea_rep1"                      
##  [8] "E18_cornea_rep2"                      
##  [9] "baseMean"                             
## [10] "baseVar"                              
## [11] "allZero"                              
## [12] "dispGeneEst"                          
## [13] "dispGeneIter"                         
## [14] "dispFit"                              
## [15] "dispersion"                           
## [16] "dispIter"                             
## [17] "dispOutlier"                          
## [18] "dispMAP"                              
## [19] "Intercept"                            
## [20] "condition_E16_vs_E8"                  
## [21] "condition_E18_vs_E8"                  
## [22] "region_cornea_vs_retina"              
## [23] "SE_Intercept"                         
## [24] "SE_condition_E16_vs_E8"               
## [25] "SE_condition_E18_vs_E8"               
## [26] "SE_region_cornea_vs_retina"           
## [27] "WaldStatistic_Intercept"              
## [28] "WaldStatistic_condition_E16_vs_E8"    
## [29] "WaldStatistic_condition_E18_vs_E8"    
## [30] "WaldStatistic_region_cornea_vs_retina"
## [31] "WaldPvalue_Intercept"                 
## [32] "WaldPvalue_condition_E16_vs_E8"       
## [33] "WaldPvalue_condition_E18_vs_E8"       
## [34] "WaldPvalue_region_cornea_vs_retina"   
## [35] "betaConv"                             
## [36] "betaIter"                             
## [37] "deviance"                             
## [38] "maxCooks"

After that, I extract results from the analysis while accounting for multiple comparisons. By setting the significance threshold (alpha) to 0.05 divided by 3 (to adjust for three independent tests), I ensure a rigorous statistical evaluation of the data. This step is important because it helps avoid false positives and ensures that any significant findings are reliable and not merely due to chance.

dds %>%
  results(independentFiltering = TRUE, alpha = 0.05/3, contrast = c("condition","E16", "E8")) -> df_results_E16vE8

dds %>%
  results(independentFiltering = TRUE, alpha = 0.05/3, contrast = c("condition","E18", "E8")) -> df_results_E18vE8

dds %>%
  results(independentFiltering = TRUE, alpha = 0.05/3, contrast = c("region","cornea", "retina")) -> df_results_corneaVretina

After extracting results from my DESeq analysis for all three of my comparisons, I focus on genes that are statistically significant (padj < 0.05). This helps identify genes that are likely to be biologically meaningful in the context of my study.

significant_genes_E16vE8 <- df_results_E16vE8[which(df_results_E16vE8$padj < 0.05), ]
significant_genes_E18vE8 <- df_results_E18vE8[which(df_results_E18vE8$padj < 0.05), ]
significant_genes_corneaVretina <- df_results_corneaVretina[which(df_results_corneaVretina$padj < 0.05), ]
significant_fold_change_genes_E16vE8 <- significant_genes_E16vE8[which(abs(significant_genes_E16vE8$log2FoldChange)>1),]

significant_fold_change_genes_E18vE8 <- significant_genes_E18vE8[which(abs(significant_genes_E18vE8$log2FoldChange)>1),]

significant_fold_change_genes_corneaVretina  <- significant_genes_corneaVretina[which(abs(significant_genes_corneaVretina$log2FoldChange)>1),]

After that, I further refine my list of significant differentially expressed genes by considering only those with an absolute log2FoldChange greater than one. This additional criterion ensures that only genes with substantial changes in expression are included in my analysis, providing more confidence in the biological relevance of the findings.

paste("The DEG analysis revealed that there are are:", length(significant_fold_change_genes_E16vE8@rownames), length(significant_fold_change_genes_E18vE8@rownames), length(significant_fold_change_genes_corneaVretina@rownames), "significant genes for my comparisions, E16 vs E8, E18 vs E8, and cornea vs retina respectively.")
## [1] "The DEG analysis revealed that there are are: 6613 6974 8225 significant genes for my comparisions, E16 vs E8, E18 vs E8, and cornea vs retina respectively."

After further refining my list of differentially expressed genes (DEGs) based on an absolute log2FoldChange greater than 1, I still had over six thousand genes in each comparison group. To streamline the analysis, I’ll focus on the top ten DEGs, as they should effectively represent each comparison.

# significant_fold_change_genes_corneaVretina %>%
#   as.data.frame() %>%
#   dplyr::arrange(desc(abs(log2FoldChange))) %>%
#   head(10) -> top10_corneaVretina

significant_fold_change_genes_corneaVretina %>%
  as.data.frame() %>%
  dplyr::arrange(desc(abs(log2FoldChange))) %>%
  head(10) -> top10_DEGs_corneaVretina


# write.csv(top10_DEGs_corneaVretina,
#           file = "./top10_DEGs_corneaVretina.csv",
#           row.names = TRUE)
top10_DEGs_corneaVretina %>%
  head(3) %>%
  DT::datatable() 

Reviewing the first three genes in this list for the cornea vs retina comparison:

“LYPD2”: This gene is predicted to be located in extracellular region and plasma membrane [4].

“KRT6A”: The protein encoded by this gene is a member of the keratin gene family. The type II cytokeratins consist of basic or neutral proteins which are arranged in pairs of heterotypic keratin chains coexpressed during differentiation of simple and stratified epithelial tissues. As many as six of this type II cytokeratin (KRT6) have been identified; the multiplicity of the genes is attributed to successive gene duplication events [4].

“S100A4”: The protein encoded by this gene is a member of the S100 family of proteins containing 2 EF-hand calcium-binding motifs. S100 proteins are localized in the cytoplasm and/or nucleus of a wide range of cells, and involved in the regulation of a number of cellular processes such as cell cycle progression and differentiation [4].

These genes are associated with developmental processes, aligning well with the origin of my data from chicken embryos. This suggests developmental differences between the cornea and retina.

significant_fold_change_genes_E16vE8 %>%
  as.data.frame() %>%
  dplyr::arrange(desc(abs(log2FoldChange))) %>%
  head(10) -> top10_DEGs_E16vE8

# write.csv(top10_DEGs_E16vE8,
#           file = "./top10_DEGs_E16vE8.csv",
#           row.names = TRUE)
top10_DEGs_E16vE8 %>%
  head(3) %>%
  DT::datatable() 

Reviewing the first three genes in this list for the E16 vs E8 comparison:

“HINTW”: HINTW is a DMRT-1 gene inhibitor during sexual development. HINTW is expressed in multiple tissues of female chicken embryos. It’s been considered a promising candidate gene for female sex determination [5].

“UBE2R2L”: Partial sex-biased gene [6], indicating a role in gender differentiation however, I couldn’t find much about this gene.

“LOC112530494”: Information about this gene is scarce, reflecting a common challenge encountered in this project due to the limited documentation of the chicken genome.

Overall, these genes suggest differential gender differentiation in chicken embryos, which is consistent with the focus of the study. The lack of comprehensive information underscores the need for further research in this area.

significant_fold_change_genes_E18vE8 %>%
  as.data.frame() %>%
  dplyr::arrange(desc(abs(log2FoldChange))) %>%
  head(10) -> top10_DEGs_E18vE8

# write.csv(top10_DEGs_E18vE8,
#           file = "./top10_DEGs_E18vE8.csv",
#           row.names = TRUE)
top10_DEGs_E18vE8 %>%
  head(3) %>%
  DT::datatable() 

Reviewing the first three genes in this list for the E18 vs E8 comparison:

“OPN1MSW”: This gene encodes opsin, a protein sensitive to green light, found in the retina. Its differential expression suggests potential changes in light sensitivity during embryonic development. One conjecture is that its presence in E18, a late-stage retina sample, may indicate the maturity of the retina and ocular region. However, further investigation is needed to confirm this hypothesis.

“HINTW”*: As previously discussed, HINTW plays a role in sexual development and appears among the top DEGs in both the E16 vs E8 and E18 vs E8 comparisons, indicating its importance in these developmental stages.

“UBE2R2L”: Similar to HINTW, UBE2R2L reappears among the top DEGs in this comparison. The recurrence of these genes suggests their potential involvement in common developmental processes between E16 and E18 embryos compared to E8.

In summary, the analysis reveals differential expression of genes associated with light sensitivity and sexual development between E18 and E8 embryos. The recurrence of certain genes across comparisons suggests their significance in developmental processes during these stages.

the files top10_DEGs_corneaVretina.csv, top10_DEGs_E16vE8.csv, top10_DEGs_E18vE8.csv can be found in the “key data sets” folder on my GitHub

HOX Genes

Now, let’s zoom in on HOX genes to delve deeper into my hypothesis. To kick things off, I combed through the list of differentially expressed genes (DEGs) in each of our comparison groups, specifically on the lookout for any HOX genes.

significant_fold_change_genes_corneaVretina %>%
  .@rownames %>%
  stringr::str_starts("HOX") %>%
  which() %>% length()
## [1] 5
significant_fold_change_genes_E16vE8%>%
  .@rownames %>%
  stringr::str_starts("HOX") %>%
  which() %>% length()
## [1] 0
significant_fold_change_genes_E18vE8 %>%
  .@rownames %>%
  stringr::str_starts("HOX") %>%
  which() %>% length()
## [1] 0

From this initial analysis, it’s clear that only 5 HOX genes show differential expression, and they’re solely in the cornea vs retina comparison group. This intriguing finding opens up possibilities for further investigation…

Next, I set out to pinpoint exactly which HOX genes are exhibiting differential expression. Check out the code snippet below:

par(mfrow=c(1,2))

# significant_fold_change_genes_E16vE8 %>%
#   .@rownames %>%
#   stringr::str_starts("HOX") %>%
#   which()
# 
# significant_fold_change_genes_E18vE8 %>%
#   .@rownames %>%
#   stringr::str_starts("HOX") %>%
#   which()

significant_fold_change_genes_corneaVretina %>%
  .@rownames %>%
  stringr::str_starts("HOX") %>%
  which() -> HOX_gene_index

significant_fold_change_genes_corneaVretina %>%
  .@rownames %>%
  .[HOX_gene_index] -> HOX_genes

# HOX_genes %>%
#   sapply(function(x) plotCounts(significant_fold_change_genes_corneaVretina, gene = x,
#                                 normalized = TRUE, xlab = "Embronic time"))

# plotCounts(dds, gene = "HOXB7", intgroup = "region",
#                                 normalized = TRUE, xlab = "Embronic time")
significant_HOX_genes <- paste(HOX_genes, collapse = ", ")

output <- paste(
  "This analysis revealed that HOX genes are only significantly expressed when comparing the optical regions retina and cornea.",
  "Out of the 39 HOX genes, there are only", length(HOX_genes), "that are significant, namely:", significant_HOX_genes
)

output
## [1] "This analysis revealed that HOX genes are only significantly expressed when comparing the optical regions retina and cornea. Out of the 39 HOX genes, there are only 5 that are significant, namely: HOXB7, HOXD10, HOXD9, HOXD8, HOXD3"

To gain insights into the expression patterns of the five differentially expressed HOX genes, I generated plots of normalized counts for each gene across the retina and cornea regions. This visual representation allows for a direct comparison of gene expression levels in these two optical regions, highlighting any significant differences between them.

By examining these plots, we can observe the expression differences of each HOX gene between the retina and cornea, providing valuable insights into the potential functional roles of these genes in the development of these optical regions.

# HOX_genes %>%
#   sapply(function(x) plotCounts(dds, gene = x, intgroup = "region",
#                                 normalized = TRUE, xlab = "Region"))

# for (i in HOX_genes) {
#   png(filename = paste0("./plots/", i, "_plotCounts.png"),  height = 5, width = 5, units = "in", res = 500)
#   plotCounts(dds, gene = i,
#              intgroup = "region",
#              normalized = TRUE,
#              xlab = "Region")
#   dev.off()  # Close the graphics device after plotting
# }
plotCounts(dds, gene="HOXB7", normalized = TRUE, intgroup = "region", 
           xlab="Embronic Time", main="(Figure 6) Expression Pattern of HOXB7 Gene in Different Regions")

plotCounts(dds, gene="HOXD10", normalized = TRUE, intgroup = "region",
           xlab="Embronic Time", main="(Figure 7) Expression Pattern of HOXB7 Gene in Different Regions")

plotCounts(dds, gene="HOXD9", normalized = TRUE, intgroup = "region",
           xlab="Embronic Time", main="(Figure 8) Expression Pattern of HOXB7 Gene in Different Regions")

plotCounts(dds, gene="HOXD8", normalized = TRUE, intgroup = "region",
           xlab="Embronic Time", main="(Figure 9) Expression Pattern of HOXB7 Gene in Different Regions")

plotCounts(dds, gene="HOXD3", normalized = TRUE, intgroup = "region",
           xlab="Embronic Time", main="(Figure 10) Expression Pattern of HOXB7 Gene in Different Regions")

As depicted in the plots, these five HOX genes exhibit higher expression levels in the cornea compared to the retina, suggesting a potential role in corneal development and function.

HOMEOBOX GENES

In this part of the analysis, the expectation is that these Homeobox genes will predominantly be expressed in the retina. Due to this anticipated expression pattern, I conducted the analysis specifically comparing the cornea versus retina differential expressed genes.

retina_homeobox_genes <- c('RAX', 'PAX6', 'LHX1', 'LHX2', 'VSX2', 'MEIS1',
                           'MEIS2', 'VSX1', 'DLX1', 'DLX2', 'DLX5', 'DLX6',
                           'POU4F2', 'POU4F1', 'POU4F3', 'ISL1', 'ISL2', 'ONECUT1',
                           'ONECUT2', 'PROX1', 'IRX5', 'IRX6', 'OTX2', 'CRX')
# retina_homeobox_genes[retina_homeobox_genes %in% significant_fold_change_genes_corneaVretina@rownames] %>%
#   sapply(function(x) plotCounts(dds, gene = x, intgroup = "region",
#                                 normalized = TRUE, xlab = "Region"))
retina_homeobox_genes[retina_homeobox_genes %in% significant_fold_change_genes_corneaVretina@rownames] -> sig_retina_homeobox_genes
  
sig_retina_homeobox_genes <- paste(sig_retina_homeobox_genes, collapse = ", ")

paste("Among this list of genes 24 genes,", retina_homeobox_genes[retina_homeobox_genes %in% significant_fold_change_genes_corneaVretina@rownames] %>% length, "of them were found to be differentially expressed. Those genes include: " , sig_retina_homeobox_genes)
## [1] "Among this list of genes 24 genes, 18 of them were found to be differentially expressed. Those genes include:  RAX, LHX1, LHX2, VSX2, MEIS2, VSX1, DLX5, DLX6, POU4F2, POU4F1, POU4F3, ISL1, ISL2, ONECUT1, ONECUT2, PROX1, IRX6, OTX2"

Upon scrutinizing the count plots for these genes, my initial expectation was confirmed. For instance, the Retina and anterior neural fold homeobox gene (RAX) serves as an exemplar. As anticipated, this gene exhibits downregulation in the cornea.

# example of homeobox gene from that paper being proven to be signifcantly expressed in retina vs cornea
plotCounts(dds, gene = "RAX", intgroup = "region",
                                normalized = TRUE, xlab = "Region",
           main = "(Figure 11) Expression Pattern of RAX in Different Regions") 

plotCounts(dds, gene = "RAX", intgroup = "condition",
                                normalized = TRUE, xlab = "Embronic Time",
           main = "(Figure 12) Expression Pattern of RAX Over Time")

RAX is among the earliest genes expressed in the developing retina. RAX is essential for early specification of eye development as demonstrated by the loss of eye structures in vertebrate models lacking RAX expression [2].

What surprised me in the analysis was the deviation of two Homeobox genes, DLX5 and DLX6, from the expected trend. Both genes were found to be upregulated in the cornea compared to the retina. In addition to this, these genes also display different expression patterns in embryogenesis. They both become downregulated in the E16 and E18 retina group compared to E8.

# DLX5 plots
plotCounts(dds, gene = "DLX5", intgroup = "region",
                                normalized = TRUE, xlab = "Region",
           main = "(Figure 13) Expression Pattern of DLX5 Over Time") 

plotCounts(dds, gene = "DLX5", intgroup = "condition",
                                normalized = TRUE, xlab = "Embronic Time",
           main = "(Figure 14) Expression Pattern of DLX5 Over Time")

# DLX6 plots
plotCounts(dds, gene = "DLX6", intgroup = "region",
                                normalized = TRUE, xlab = "Region",
           main = "(Figure 15) Expression Pattern of DLX6 Over Time") 

plotCounts(dds, gene = "DLX6", intgroup = "condition",
                                normalized = TRUE, xlab = "Embronic Time",
           main = "(Figure 16) Expression Pattern of DLX6 Over Time")

According to Zagozewski et al. (2014), the exact roles of DLX5 and DLX6 in retinogenesis remain unclear, but their unexpected upregulation underscores the complexity of the regulatory network of Homeobox genes.

Another paper entitled, “A Transcriptomic Analysis of Differential Gene Expression during Chick Periocular Neural Crest Differentiation into Corneal Cells”, reports that DLX6…was upregulated during corneal development. These results indicate that the molecular signature of NCC is maintained in the periocular region, and partially during early corneal development.

In summarizing my DEG analysis for HOX and Homeobox genes, it becomes evident that these genes exhibit differential expression patterns across various embryonic time points (E8, E16, and E18) and optical regions (retina and cornea). The results strongly suggest dynamic changes in gene expression throughout embryonic development, highlighting the intricate regulatory mechanisms governing tissue differentiation.

This comprehensive investigation sheds light on the developmental dynamics within the chicken organism, offering valuable insights into the molecular underpinnings of embryonic development. By elucidating the expression patterns of key regulatory genes like HOX and Homeobox genes, this study contributes to our understanding of the intricate processes orchestrating tissue specification and differentiation during early embryogenesis.

Volcano Plots

After concluding my DEG analysis for HOX and Homeobox genes, I will now zoom out and observe differentiation between my three comparisons and categorize those differentially expressed genes using GO analysis. Before then, I will display my differentially expressed genes as a volcano plot for each comparison.

vp1 <- EnhancedVolcano(significant_fold_change_genes_corneaVretina, 
                       lab=rownames(significant_fold_change_genes_corneaVretina), 
                       x = 'log2FoldChange', y = 'padj',
                       encircle = HOX_genes,
                       #shade = HOX_genes,
                       pCutoff = 0.05, 
                       title = "cornea / retina",
                       caption = "(Figure 17) Volcano plot showing the differentially expressed genes (DEGs) in the cornea compared to the retina, with log2FoldChange and adjusted p-value (padj).")
## Warning: One or more p-values is 0. Converting to 10^-1 * current lowest
## non-zero p-value...
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2
print(vp1)

vp1 <- EnhancedVolcano(significant_fold_change_genes_E16vE8, 
                       lab = rownames(significant_fold_change_genes_E16vE8), 
                       x = 'log2FoldChange', y ='padj',
                       #encircle = HOX_genes,
                       #shade = HOX_genes,
                       pCutoff = 0.05, 
                       title ="E16 / E8",
                       caption = "(Figure 18) Volcano plot illustrating the differentially expressed genes (DEGs) in the E16 compared to E8.")
## Warning: One or more p-values is 0. Converting to 10^-1 * current lowest
## non-zero p-value...
print(vp1)

vp1 <- EnhancedVolcano(significant_fold_change_genes_E18vE8, 
                       lab=rownames(significant_fold_change_genes_E18vE8), 
                       x = 'log2FoldChange', y='padj',
                       #encircle = HOX_genes,
                       #shade = HOX_genes,
                       pCutoff = 0.05, 
                       title = "E18 / E8",
                       caption = "(Figure 19) Volcano plot showing the differentially expressed genes (DEGs) in the E18 compared to E8.")
## Warning: One or more p-values is 0. Converting to 10^-1 * current lowest
## non-zero p-value...
print(vp1)

GO Analysis

Cornea vs. Retina Ontology: Molecular Function (MF)

ego <- enrichGO(gene = significant_fold_change_genes_corneaVretina@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "MF", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - Cornea vs. Retina (MF)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./corneaVretina_MF_GO_terms.csv",
#           row.names = FALSE)

(Figure 20) Volcano plot showing the differentially expressed genes (DEGs) in the cornea compared to the retina, with log2FoldChange and adjusted p-value (padj).

Considering the subontology molecular function (MF), the GO terms that were most prevalent were transmembrane signaling receptor activity, calcium ion binding, channel activity, passive transmembrane transporter activity, and monatomic ion channel activity. These terms suggest potential roles for these genes in cellular signaling, ion transport, and membrane function, which could be relevant to processes occurring in the cornea and retina.

Cornea vs. Retina Ontology: Cellular Component (CC)

ego <- enrichGO(gene = significant_fold_change_genes_corneaVretina@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "CC", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - Cornea vs. Retina (CC)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./corneaVretina_CC_GO_terms.csv",
#           row.names = FALSE)

(Figure 21) Volcano plot showing the differentially expressed genes (DEGs) in the cornea compared to the retina, with log2FoldChange and adjusted p-value (padj).

Considering the subontology cellular component (CC), the GO terms that were most prevalent were plasma membrane, extracellular region, cell projection, plasma membrane bounded cell projection, and cell junction. These terms suggest potential roles for these genes in cell adhesion, membrane organization, and intercellular communication, which could be relevant to the structural and functional differences between the cornea and retina.

Cornea vs. Retina Ontology: Biological Process (BP)

ego <- enrichGO(gene = significant_fold_change_genes_corneaVretina@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "BP", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - Cornea vs. Retina (BP)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./corneaVretina_BP_GO_terms.csv",
#           row.names = FALSE)

(Figure 22) Volcano plot showing the differentially expressed genes (DEGs) in the cornea compared to the retina, with log2FoldChange and adjusted p-value (padj).

Considering the biological processes (BP), the GO terms that were most prevalent were multicellular organism development, cell differentiation, cellular developmental process, system development, and animal organ development. These terms suggest potential roles for these genes in embryonic development, tissue differentiation, and organogenesis, which could be relevant to the developmental differences between the cornea and retina.

E16 vs. E8 Ontology: Molecular Function (MF)

ego <- enrichGO(gene = significant_fold_change_genes_E16vE8@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "MF", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - E16 vs. E8 (MF)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./E16vE8_MF_GO_terms.csv",
#           row.names = FALSE)

(Figure 23) Volcano plot showing the differentially expressed genes (DEGs) between embryonic day 16 (E16) and embryonic day 8 (E8), with log2FoldChange and adjusted p-value (padj).

Considering the molecular function (MF), some of the GO terms that were most prevalent were signaling receptor activity, molecular transducer activity, transmembrane signaling receptor activity, and channel activity. These terms suggest potential roles for these genes in cellular signaling, sensory perception, and ion transport, which could be relevant to the developmental differences between E16 and E8.

E16 vs. E8 Ontology: Cellular Component (CC)

ego <- enrichGO(gene = significant_fold_change_genes_E16vE8@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "CC", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - E16 vs. E8 (CC)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./E16vE8_CC_GO_terms.csv",
#           row.names = FALSE)

(Figure 24) Volcano plot showing the differentially expressed genes (DEGs) between embryonic day 16 (E16) and embryonic day 8 (E8), with log2FoldChange and adjusted p-value (padj).

Considering the cellular component (CC), some of the GO terms that were most prevalent were plasma membrane, cell junction, neuron projection, plasma membrane region, and plasma membrane protein complex. These terms suggest potential roles for these genes in cell adhesion, membrane organization, and neuronal development, which could be relevant to the differences between E16 and E8.

E16 vs. E8 Ontology: Biological Process (BP)

ego <- enrichGO(gene = significant_fold_change_genes_E16vE8@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "BP", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - E16 vs. E8 (BP)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./E16vE8_BP_GO_terms.csv",
#           row.names = FALSE)

(Figure 25) Volcano plot showing the differentially expressed genes (DEGs) between embryonic day 16 (E16) and embryonic day 8 (E8), with log2FoldChange and adjusted p-value (padj).

Considering the biological processes (BP), the GO terms that were most prevalent were multicellular organism development, system development, nervous system development, cell-cell signaling, and system process. These terms suggest potential roles for these genes in embryonic development, organogenesis, and neuronal differentiation, which could be relevant to the differences between E16 and E8.

E18 vs. E8 Ontology: Molecular Function (MF)

ego <- enrichGO(gene = significant_fold_change_genes_E18vE8@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "MF", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - E18 vs. E8 (MF)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./E18vE8_MF_GO_terms.csv",
#           row.names = FALSE)

(Figure 26) Volcano plot showing the differentially expressed genes (DEGs) between embryonic day 18 (E18) and embryonic day 8 (E8), with log2FoldChange and adjusted p-value (padj).”

Considering the molecular function (MF), some of the GO terms that were most prevalent were signaling receptor activity, molecular transducer activity, transmembrane signaling receptor activity, and channel activity. These terms suggest potential roles for these genes in cellular signaling, sensory perception, and ion transport, which could be relevant to the developmental differences between E18 and E8.

E18 vs. E8 Ontology: Cellular Component (CC)

ego <- enrichGO(gene = significant_fold_change_genes_E18vE8@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "CC", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - E18 vs. E8 (CC)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./E18vE8_CC_GO_terms.csv",
#           row.names = FALSE)

(Figure 27) Volcano plot showing the differentially expressed genes (DEGs) between embryonic day 18 (E18) and embryonic day 8 (E8), with log2FoldChange and adjusted p-value (padj).

Considering the cellular component (CC), some of the GO terms that were most prevalent were plasma membrane, cell projection, plasma membrane bounded cell projection, cell junction, neuron projection, etc. These terms suggest potential roles for these genes in cell adhesion, membrane organization, and neuronal development, which could be relevant to the differences between E18 and E8.

E18 vs. E8 Ontology: Biological Process (BP)

ego <- enrichGO(gene = significant_fold_change_genes_E18vE8@rownames,
                OrgDb = org.Gg.eg.db,
                keyType = "SYMBOL",
                ont = "BP", # MF, CC, BP
                pAdjustMethod = "BH",
                qvalueCutoff = 0.05,
                readable = TRUE)

dotplot(ego, title = "GO Enrichment Analysis - E18 vs. E8 (BP)") + ggplot2::theme_minimal()

# ego %>% as.data.frame()
# write.csv(ego,
#           file = "./E18vE8_BP_GO_terms.csv",
#           row.names = FALSE)

(Figure 28) Volcano plot showing the differentially expressed genes (DEGs) between embryonic day 18 (E18) and embryonic day 8 (E8), with log2FoldChange and adjusted p-value (padj).

Considering the biological processes (BP), the GO terms that were most prevalent were multicellular organism development, system development, anatomical structure morphogenesis, nervous system development, regulation of cell population proliferation, etc. These terms suggest potential roles for these genes in embryonic development, organogenesis, and neuronal differentiation, which could be relevant to the differences between E18 and E8.

The GO term analyses for all three comparisons revealed consistent patterns, with significant genes being predominantly involved in developmental processes such as multicellular organism development, system development, and nervous system development. These findings align with the embryonic RNA data from the retina and cornea of a chicken at time points E8, E16, and E18. This correlation supports the significance of the five HOX genes and the Homeobox genes, which play crucial roles in the development of the eye.

Discussion

One evident issue in my analysis pertains to code efficiency. While functional, some of my code, particularly the alignment code, could have been optimized for speed and robustness. My tendency to write R code in a more intricate manner, although it may not always be the most straightforward approach, enhances the versatility and functionality of my codebase.

Another challenge I encountered was obtaining accurate information about genes in the chicken genome. The chicken genome appears to be less extensively studied compared to organisms like humans or mice, making it difficult to find comprehensive data. This limitation hindered my ability to fully explore gene functions and pathways specific to chicken biology.

Additionally, a limitation arises from the selection of Homeobox genes. While I focused on 24 genes known to be expressed in the retina, the vast array of Homeobox genes presents the possibility of overlooking important candidates or encountering off-target effects. This challenge underscores the need for careful consideration in gene selection to maintain alignment with the project’s hypothesis and objectives. Balancing specificity with comprehensiveness is crucial to avoid deviating too far from my research focus.

Key Data Sets

File Location Description
top10_DEGs_corneaVretina.csv key_data_sets CSV file containing top ten DEGs for cornea vs retina comparison.
top10_DEGs_E16vE8.csv key_data_sets CSV file containing top ten DEGs for E16 vs E8 comparison.
top10_DEGs_E18vE8.csv key_data_sets CSV file containing top ten DEGs for E18 vs E8 comparison.
corneaVretina_BP_GO_terms.csv key_data_sets CSV file containing Biological Process Gene Ontology (GO) terms for cornea vs retina comparison.
corneaVretina_CC_GO_terms.csv key_data_sets CSV file containing Cellular Component Gene Ontology (GO) terms for cornea vs retina comparison.
corneaVretina_MF_GO_terms.csv key_data_sets CSV file containing Molecular Function Gene Ontology (GO) terms for cornea vs retina comparison.
E16vE8_BP_GO_terms.csv key_data_sets CSV file containing Biological Process Gene Ontology (GO) terms for E16 vs E8 comparison.
E16vE8_CC_GO_terms.csv key_data_sets CSV file containing Cellular Component Gene Ontology (GO) terms for E16 vs E8 comparison.
E16vE8_MF_GO_terms.csv key_data_sets CSV file containing Molecular Function Gene Ontology (GO) terms for E16 vs E8 comparison.
E18vE8_BP_GO_terms.csv key_data_sets CSV file containing Biological Process Gene Ontology (GO) terms for E18 vs E8 comparison.
E18vE8_CC_GO_terms.csv key_data_sets CSV file containing Cellular Component Gene Ontology (GO) terms for E18 vs E8 comparison.
E18vE8_MF_GO_terms.csv key_data_sets CSV file containing Molecular Function Gene Ontology (GO) terms for E18 vs E8 comparison.

References

  1. Langouet-Astrie, C., Meinsen, A., Grunwald, E. et al. RNA sequencing analysis of the developing chicken retina. Sci Data 3, 160117 (2016). https://doi.org/10.1038/sdata.2016.117

  2. Jamie L. Zagozewski, Qi Zhang, Vanessa I. Pinto, Jeffrey T. Wigle, David D. Eisenstat, The role of homeobox genes in retinal development and disease, Developmental Biology, Volume 393, Issue 2, 2014, Pages 195-208, SSN 0012-1606, https://doi.org/10.1016/j.ydbio.2014.07.004.

  3. Bi, L., & Lwigale, P. (2019). Transcriptomic analysis of differential gene expression during chick periocular neural crest differentiation into corneal cells. Developmental dynamics : an official publication of the American Association of Anatomists, 248(7), 583–602. https://doi.org/10.1002/dvdy.43

  4. Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, Stein TI, Nudel R, Lieder I, Mazor Y, Kaplan S, Dahary D, Warshawsky D, Guan-Golan Y, Kohn A, Rappaport N, Safran M, Lancet D. The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Current Protocols in Bioinformatics. 2016;54:1.30.1-1.30.33. doi: 10.1002/cpbi.5. PMID: 27603008.

  5. Nagai H, Sezaki M, Bertocchini F, Fukuda K, Sheng G. HINTW, a W-chromosome HINT gene in chick, is expressed ubiquitously and is a robust female cell marker applicable in intraspecific chimera studies. Genesis. 2014 May;52(5):424-30. doi: 10.1002/dvg.22769. Epub 2014 Mar 13. PMID: 24599776.

  6. Luo, H., Zhou, H., Jiang, S., He, C., Xu, K., Ding, J., Liu, J., Qin, C., Chen, K., Zhou, W., Wang, L., Yang, W., Zhu, W., & Meng, H. (2023). Gene Expression Profiling Reveals Potential Players of Sex Determination and Asymmetrical Development in Chicken Embryo Gonads. International journal of molecular sciences, 24(19), 14597. https://doi.org/10.3390/ijms241914597